Also known as, the Erlang Virtual Machine. You can read about where the name came from [here](https://en.wikipedia.org/wiki/BEAM_(Erlang_virtual_machine)), but the important bit is that the BEAM is the VM that runs all Erlang, Elixir, Gleam, etc code. It's part of the Erlang Run-Time System(ERTS), but is often credited with what makes Erlang and Elixir magical, even though there are many other integral parts to that experience.
Talking specifically from the point of view of the Elixir community, you could think of our [platform core values](https://www.youtube.com/watch?v=2wZ1pCpJUIM) as:
- Availability
- Composability
- Simplicity
- Velocity
To a lesser extent, I think it also does a pretty good job at Approachability and Operability. But the original four can be seen both in the makeup of the whole system, and in much of the community infrastructure.
Also, Elixir in particular has great interoperability with Erlang and Gleam, since they all compile down to BEAM bytecode.
## The Process and The Scheduler
At its core, the BEAM is made up of processes. A process is not the same thing as an OS Thread, a Channel, or even a Green Thread, though it is somewhat similar to the latter. All code runs inside processes.
Processes are completely isolated from each other, and can only communicate via message passing. They are so lightweight that you can run millions simultaneously with little overhead. Also, each individual process is sequential, but since you can start up as many others as you want, you can achieve easy concurrency with them.
Processes have three basic properties:
- A Process ID, also known as a `#PID`. This is a distinct type in Elixir.
- A message mailbox. This is an ordered queue of received messages for the process.
- A set of 'linked' processes. If any of those crash, it is notified, and if that crash is unhandled by this process, it too will crash.
The mailbox is the mechanism through which a process can receive messages. "Sending" a message is just putting a new message in the mailbox. Any given message is an Erlang 'term', which is just a fancy way of saying an immutable value.
*Everything* in Erlang/Elixir is immutable. The BEAM might mutate something behind the scenes, but as far as your code is concerned, it simply isn't possible to mutate anything.
## The Scheduler
So, how can we have millions of simultaneous, isolated processes on a machine with only 8 cores? Or 4? Or 1? The scheduler is the answer.
All code running in the BEAM is managed by the scheduler, so that it can ensure that no individual process can block the rest of the application from running.
The way it does this, is with sets of priority queues and tracking not only each process, but every function a process runs and how long the current running process has had control.
Using this information, it can pause a process mid-run to let another process run some of its code, making hard-locking almost impossible in the BEAM.
One weakness in this approach is that if you're ever calling out to another programming language (like C or [Rust](https://github.com/rusterlium/rustler)) using Natively Implemented Functions(NIFs), you've got two specific limitations:
- NIFs cannot crash without bringing down the entire application. Where Erlang/Elixir code can crash safely and be handled in the supervision tree, there's no way to ensure the NIF crashing isn't a fundamental error.
- NIFs must be limited in total time per run, since other BEAM processes expect to get control back relatively soon.
## Open Telecom Platform
Oh, what a scary term for such an obvious set of things. It's a set of useful middleware, libraries, and tools, all written in Erlang. Nothing about it is specific to Telecom applications, this was just the original branding Ericsson had for it, especially since at the time there wasn't the same sort of need for distributed systems as we have today, with Moore's law and all.
The most important bits of OTP are GenServers and Supervisors.
### GenServers
A GenServer is a process like any other, with built-in tools for keeping state, responding to requests, and executing code asynchronously. Additionally, they include a bunch of tooling for tracing an error reporting.
The benefit of them, is that they can safely operate as state for your application. Need a place to store database connections, so you're not constantly opening new ones? Need to keep track of who's been seen in the last 20 minutes? Need to check an endpoint every 60 seconds to check for updates? Here you go.
Almost all state in Elixir is stored in some kind of GenServer. Implementing one is as simple as specifying how it should start (via an `init` function), and another implementing the core functionality, generally specified as either a `call` or `cast`. Calls are synchronous, and Casts are asynchronous- but this doesn't mean you're going to be using `await` with casts. Instead, casts finish immediately, and the expectation is that if you expect a reply later, you can receive it later.
### Supervisors
These are a special set of GenServers, whose state is made up of the child processes it is linked to, and how to interact with them. Their most essential role, is tracking when one of its GenServers crashes.
Supervisors can have their children specified from the beginning, or added to/subtracted from as they go. Usually however, it's the Supervisor's job to start and stop their GenServers, starting and linking them with whatever starting arguments for initialization are required.
When specifying or creating children, you can specify whether it should always be restarted, to restart only if it terminated abnormally, or should never be restarted. There are restart strategies to be chosen if desired, which can easily implement certain inter-dependent functionality.
However, simply restarting in an endless loop wouldn't be helpful. Instead, after restarting, the supervisor watches. If its child process crashes enough times(by default 3) within a time limit(by default 5 seconds)... the supervisor will crash.
## Supervision trees, or "Let it crash"
A BEAM application is effectively made up of trees of supervisors and their workers. If a worker crashes enough times in a time period, it'll crash its supervisor. If it similarly crashes enough times in a time period, that supervisor's supervisor will crash. And so on and so forth, until finally the BEAM itself crashes.
But think about it: How often does an application of yours make it into production while immediately crashing? Pretty rare, right? And what do you do, when it finally does? Well, you restart it, of course.
So, follow this to the logical conclusion. What if we just... ignored, the majority of these errors? If the system restarts itself back into a good state every time, do we actually care when part of it goes wrong?
Sometimes you do. But a lot of times, you don't. And even on those times, you often have more time to fix it, since the system will be in a good state for longer than it would be with a normal application.
This isn't to say that they skimp on dealing with errors and observability: To the contrary.
## Observability & Debugging Tools
For one, there's incredibly powerful tools in the VM itself:
![:observer.start()](https://miro.medium.com/max/962/0*MB6liPn2huog25GU.png)
By typing `:observer.start()` into the REPL(`iex` in your terminal) on Mac/Linux, you'll open that window(getting this working on Windows is much harder, but keep reading to see why this doesn't matter as much as you'd think). From there, you can get all the system details, CPU/Memory usage, statistics, charts, and more. You can even explore the entire supervision tree, or even watch specific processes, including their entire stack trace and state. Live.
- **On windows?** You can use [`observer_cli`](https://github.com/zhongwencool/observer_cli).
- **On a different machine with Elixir?** You can start a REPL, [connect to the node](https://stackoverflow.com/a/46231900), and explore it on the same observer from before!
- **On a different machine WITHOUT Elixir, or just don't want the above hassle?** Stick [Phoenix Live Dashboard](https://github.com/phoenixframework/phoenix_live_dashboard) on there. Tahdah.
All the stats available there are available programmatically too. It's dead-simple to write code to access it.
### Telemetry
But what if you just want general observability you can plug into existing platforms? Well, we've got the best support around, with the `:telemetry` system.
Essentially, `:telemetry` is a platform-wide event library specifically built for application telemetry uses. Anytime a library author, application author, or anyone else wants to expose some of their telemetry data(like logs, metrics, or traces), they just toss it in.
Then, when you want to do something based on that telemetry, whether it's send it to a Prometheus server, Honeycomb, or maybe just email your boss- you just build a hook that takes what it should listen for, and what to do when it gets data. And with how it's implemented, no matter how slow what you're doing with the telemetry data is, it won't affect how your application runs.
## How's it perform?
Like I said before, literally millions of individual, isolated processes running per machine. And that isn't with each process idle.
- Howabout [2 million websocket connections](https://www.phoenixframework.org/blog/the-road-to-2-million-websocket-connections) with actual traffic running through?
- Howabout [WhatsApp supporting 465m users with 550 servers, with only 10 employees total out of dev/ops](http://highscalability.com/blog/2014/3/31/how-whatsapp-grew-to-nearly-500-million-users-11000-cores-an.html)?
- What about supporting over 10,000 users in a single multiplayer session for a flight simulator? [Did I mention it was built by one developer who had never used Elixir before in 6 months?](https://elixir-lang.org/blog/2021/07/29/bootstraping-a-multiplayer-server-with-elixir-at-x-plane/)
- Depended on by WhatsApp, Discord, Pinterest, Moz, Grindr, Bleacher Report, Postmates, Pepsi, Slack, Procore, and thousands more.
## What about the community?
- The best database wrapper/query generator ever made, [Ecto](https://hexdocs.pm/ecto/getting-started.html)(By the way, it supports [CONCURRENT DATABASE TESTS????](https://hexdocs.pm/ecto_sql/Ecto.Adapters.SQL.Sandbox.html))
- A web framework so fast you'll learn what the symbol is for a microsecond just from your response times, [Phoenix](https://hexdocs.pm/phoenix/overview.html#content)
- The best GraphQL API library in any language, [Absinthe](https://hexdocs.pm/absinthe/overview.html)
- A library to turn nearly any piece of work as concurrent as you want, with almost zero effort(literally just replace the module name), [Flow](https://hexdocs.pm/flow/Flow.html)
- Kafka without the bullshit and with a lot more beauty, [Broadway](https://hexdocs.pm/broadway/Broadway.html)(look at that feature list!)
- Automatic cluster formation/healing with the simplest configuration around via [libcluster](https://github.com/bitwalker/libcluster)
- Automatic state failover across your cluster, making state continuing across websocket connections a breeze with [Horde](https://hexdocs.pm/horde/readme.html)
- Tools for data science and machine learning that can rival Python and Julia, [nx](https://github.com/elixir-nx/nx)
I can go on.
## Sounds too good to be true.
Yeah, but in effect, this is what happens when you have a small, tight-knit community reliably working on something for over 30 years in Erlang, only to find out the kinds of problems modern systems are plagued with were what Erlang was built to handle from the first moment.
It isn't perfect for everything though.
- It'll lose in base memory usage to C/C++, Rust, Golang, Python, Javascript, and some Java applications. However, [it also has great tools for per-process memory management](https://hexdocs.pm/elixir/Process.html#flag/2).
- It'll lose in raw processing to C/C++, Rust, or Golang. However, if it's data-sciency, once again [nx](https://github.com/elixir-nx/nx) keeps Elixir in the running beside Python and Julia, the best-in-class.
- It isn't completely type-safe, like Rust or Haskell. In fact, it's (strongly) dynamically typed!
- There are fewer libraries than many other languages have. However, the basic tools are better than most languages, so making up that difference is usually very easy.
- The tools are so powerful and easy, it becomes simple to make big mistakes. I've got stories.
At the end of the day, it isn't going to break the laws of physics for you(distributed systems), and it isn't going to help you make perfect software(formal systems). But it'll help you get the job done more reliably, with less work, than anything else out there.
And for what it's worth, I didn't even mention the particulars of Elixir the language at all- and it's my favorite language, by far.